Real-Time Document Cluster Analysis for Dynamic Data Sets
نویسندگان
چکیده
One of the most challenging analysis problems in the data mining and information retrieval domains is organizing large amounts of information. In this paper, we present a fast agglomerative clustering technique used in the Virtual Information Processing Agent Research (VIPAR) project at the Oak Ridge National Laboratory. This approach extends the Vector Space Model (VSM) to provide near real-time clustering of a moderately large and dynamic set of text documents. In the traditional VSM, each document added or removed from the document set requires a computationally expensive set of operations to be performed before analysis can resume. While this prior calculation of all of the VSM values is feasible in some problem domains, it is computationally prohibitive for near real-time operation with large, dynamic sets of documents. We present a method to quickly and accurately update the VSM values in an environment where articles are being continuously added and removed. We conducted a series of experiments and based on the results, we implemented a strategy that provides dynamic updates to the VSM values while preserving high accuracy. This approach allows documents to be quickly added and removed from the document set, providing accurate agglomerative clustering and enabling real-time document analysis. Introduction One of the most challenging analysis problems in the data mining and information retrieval domains is organizing large amounts of information. One approach to this problem is to cluster information based on the content of a collection of documents. One widely used technique is to represent documents as vectors in a Vector Space Model (VSM), compare those documents in a dissimilarity matrix, and use agglomerative clustering to represent the document comparisons in a dendogram. The Virtual Information Processing Agent Research (VIPAR) project at the Oak Ridge National Laboratory makes significant use of this clustering technique with the goal of enabling information from a number of Internet media sources to be integrated, then rapidly searched, clustered, analyzed, and visually presented to an analyst for improved decision making. In VIPAR, thousands of articles from Internet newspapers come streaming into the VIPAR system throughout the day in an asynchronous manner. These articles must be immediately placed into article clusters, and in near real-time, provided to analysts using VIPAR. However, these requirements of dynamic datasets and real-time responsiveness were contradictory. While the VSM-based agglomerative clustering approach has demonstrated value in organizing * Oak Ridge National Laboratory is managed by UT-Battelle, LLC. The submitted manuscript has been authored by a contractor of the U.S. Government under contract No DE-AC05-00OR22725. Accordingly, the U.S. Government retains a non-exclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so, for U.S. Government Purposes.
منابع مشابه
Dynamic Hierarchical Compact Clustering Algorithm
In this paper we introduce a general framework for hierarchical clustering that deals with both static and dynamic data sets. From this framework, different hierarchical agglomerative algorithms can be obtained, by specifying an inter-cluster similarity measure, a subgraph of the β-similarity graph, and a cover algorithm. A new clustering algorithm called Hierarchical Compact Algorithm and its ...
متن کاملA Layout - Analysis Based System for Document Image Retrieval ! ! !
Document Image Retrieval! !! G. Pirlo , M. Chimienti, M. Dassisti, D. Impedovo, A. Galiano !!!!!!! Abstract. This paper presents new system for document image retrieval, based on layout-analysis. The system, that is well suited for commercial form retrieval, uses Radon Transform for layout description and Dynamic Time Warping for document image matching. The experimental results, that were cond...
متن کاملAnalysis of Document Clustering using Pseudo Dynamic Quantum Clustering Approach
---------------------------------------------------------------------***--------------------------------------------------------------------Abstract In the field of information processing like data mining, information retrieval, natural language processing and machine learning, Quantum Computing play vital role for extracting the implicit, potentially useful and previously unknown information f...
متن کاملA Bio-inspired Clustering Approach for Dynamic Document Distributed Analysis
Document clustering is a fundamental operation used in unsupervised document organization, automatic topic extraction and information retrieval. But most clustering technologies are limited in their application on the static document collection. Intelligence analysts are currently overwhelmed with tremendous amount of text information streams generated everyday. There is a lack of comprehensive...
متن کاملDependent nonparametric trees for dynamic hierarchical clustering
Hierarchical clustering methods offer an intuitive and powerful way to model a wide variety of data sets. However, the assumption of a fixed hierarchy is often overly restrictive when working with data generated over a period of time: We expect both the structure of our hierarchy, and the parameters of the clusters, to evolve with time. In this paper, we present a distribution over collections ...
متن کامل